Import Libraries & Set Up¶


In [1]:
import warnings
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
In [2]:
warnings.filterwarnings('ignore')

palette = ['#800080', '#8A2BE2', '#FF69B4', '#DA70D6', '#9370DB', '#DDA0DD', '#BA55D3']
gradient_palette = sns.light_palette('#620080', as_cmap=True)
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=palette)
sns.set_theme(style="whitegrid", palette=palette)

Dementia Dataset¶


Import Dataset & Examine¶


Import Dataset¶


In [3]:
dementia_df = pd.read_csv('data/dementia_data-MRI-features.csv')

Dataset Info & Structure¶


In [4]:
print(dementia_df.shape)
(373, 15)
In [5]:
print(dementia_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Subject ID  373 non-null    object 
 1   MRI ID      373 non-null    object 
 2   Group       373 non-null    object 
 3   Visit       373 non-null    int64  
 4   MR Delay    373 non-null    int64  
 5   M/F         373 non-null    object 
 6   Hand        373 non-null    object 
 7   Age         373 non-null    int64  
 8   EDUC        373 non-null    int64  
 9   SES         354 non-null    float64
 10  MMSE        371 non-null    float64
 11  CDR         373 non-null    float64
 12  eTIV        373 non-null    int64  
 13  nWBV        373 non-null    float64
 14  ASF         373 non-null    float64
dtypes: float64(5), int64(5), object(5)
memory usage: 43.8+ KB
None
In [6]:
dementia_df.head()
Out[6]:
Subject ID MRI ID Group Visit MR Delay M/F Hand Age EDUC SES MMSE CDR eTIV nWBV ASF
0 OAS2_0001 OAS2_0001_MR1 Nondemented 1 0 M R 87 14 2.0 27.0 0.0 1987 0.696 0.883
1 OAS2_0001 OAS2_0001_MR2 Nondemented 2 457 M R 88 14 2.0 30.0 0.0 2004 0.681 0.876
2 OAS2_0002 OAS2_0002_MR1 Demented 1 0 M R 75 12 NaN 23.0 0.5 1678 0.736 1.046
3 OAS2_0002 OAS2_0002_MR2 Demented 2 560 M R 76 12 NaN 28.0 0.5 1738 0.713 1.010
4 OAS2_0002 OAS2_0002_MR3 Demented 3 1895 M R 80 12 NaN 22.0 0.5 1698 0.701 1.034

Statistical Summary¶


In [7]:
dementia_df.describe().T
Out[7]:
count mean std min 25% 50% 75% max
Visit 373.0 1.882038 0.922843 1.000 1.000 2.000 2.000 5.000
MR Delay 373.0 595.104558 635.485118 0.000 0.000 552.000 873.000 2639.000
Age 373.0 77.013405 7.640957 60.000 71.000 77.000 82.000 98.000
EDUC 373.0 14.597855 2.876339 6.000 12.000 15.000 16.000 23.000
SES 354.0 2.460452 1.134005 1.000 2.000 2.000 3.000 5.000
MMSE 371.0 27.342318 3.683244 4.000 27.000 29.000 30.000 30.000
CDR 373.0 0.290885 0.374557 0.000 0.000 0.000 0.500 2.000
eTIV 373.0 1488.128686 176.139286 1106.000 1357.000 1470.000 1597.000 2004.000
nWBV 373.0 0.729568 0.037135 0.644 0.700 0.729 0.756 0.837
ASF 373.0 1.195461 0.138092 0.876 1.099 1.194 1.293 1.587
In [8]:
print(f"Number of unique subjects: {len(dementia_df['Subject ID'].unique())}")
Number of unique subjects: 150

Preparing the Data¶


Target Examination¶


In [9]:
sns.countplot(x=dementia_df['Group'], palette=palette)
Out[9]:
<Axes: xlabel='Group', ylabel='count'>
No description has been provided for this image
In [10]:
dementia_df.Group.value_counts()
Out[10]:
Group
Nondemented    190
Demented       146
Converted       37
Name: count, dtype: int64

The converted category consists of 37 records for 14 subjects.

In [11]:
dementia_df.loc[dementia_df.Group == 'Converted']
Out[11]:
Subject ID MRI ID Group Visit MR Delay M/F Hand Age EDUC SES MMSE CDR eTIV nWBV ASF
33 OAS2_0018 OAS2_0018_MR1 Converted 1 0 F R 87 14 1.0 30.0 0.0 1406 0.715 1.248
34 OAS2_0018 OAS2_0018_MR3 Converted 3 489 F R 88 14 1.0 29.0 0.0 1398 0.713 1.255
35 OAS2_0018 OAS2_0018_MR4 Converted 4 1933 F R 92 14 1.0 27.0 0.5 1423 0.696 1.234
36 OAS2_0020 OAS2_0020_MR1 Converted 1 0 M R 80 20 1.0 29.0 0.0 1587 0.693 1.106
37 OAS2_0020 OAS2_0020_MR2 Converted 2 756 M R 82 20 1.0 28.0 0.5 1606 0.677 1.093
38 OAS2_0020 OAS2_0020_MR3 Converted 3 1563 M R 84 20 1.0 26.0 0.5 1597 0.666 1.099
57 OAS2_0031 OAS2_0031_MR1 Converted 1 0 F R 86 12 3.0 30.0 0.0 1430 0.718 1.227
58 OAS2_0031 OAS2_0031_MR2 Converted 2 446 F R 88 12 3.0 30.0 0.0 1445 0.719 1.215
59 OAS2_0031 OAS2_0031_MR3 Converted 3 1588 F R 91 12 3.0 28.0 0.5 1463 0.696 1.199
81 OAS2_0041 OAS2_0041_MR1 Converted 1 0 F R 71 16 1.0 27.0 0.0 1289 0.771 1.362
82 OAS2_0041 OAS2_0041_MR2 Converted 2 756 F R 73 16 1.0 28.0 0.0 1295 0.768 1.356
83 OAS2_0041 OAS2_0041_MR3 Converted 3 1331 F R 75 16 1.0 28.0 0.5 1314 0.760 1.335
114 OAS2_0054 OAS2_0054_MR1 Converted 1 0 F R 85 18 1.0 29.0 0.0 1264 0.701 1.388
115 OAS2_0054 OAS2_0054_MR2 Converted 2 846 F R 87 18 1.0 24.0 0.5 1275 0.683 1.376
194 OAS2_0092 OAS2_0092_MR1 Converted 1 0 F R 83 12 2.0 28.0 0.0 1383 0.748 1.269
195 OAS2_0092 OAS2_0092_MR2 Converted 2 706 F R 84 12 2.0 27.0 0.5 1390 0.728 1.263
218 OAS2_0103 OAS2_0103_MR1 Converted 1 0 F R 69 16 1.0 30.0 0.0 1404 0.750 1.250
219 OAS2_0103 OAS2_0103_MR2 Converted 2 1554 F R 74 16 1.0 30.0 0.5 1423 0.722 1.233
220 OAS2_0103 OAS2_0103_MR3 Converted 3 2002 F R 75 16 1.0 30.0 0.5 1419 0.731 1.236
245 OAS2_0118 OAS2_0118_MR1 Converted 1 0 F R 67 14 4.0 30.0 0.0 1508 0.794 1.164
246 OAS2_0118 OAS2_0118_MR2 Converted 2 1422 F R 71 14 4.0 26.0 0.5 1529 0.788 1.147
261 OAS2_0127 OAS2_0127_MR1 Converted 1 0 M R 79 18 1.0 29.0 0.0 1644 0.729 1.067
262 OAS2_0127 OAS2_0127_MR2 Converted 2 851 M R 81 18 1.0 29.0 0.5 1654 0.720 1.061
263 OAS2_0127 OAS2_0127_MR3 Converted 3 1042 M R 81 18 1.0 29.0 0.5 1647 0.717 1.066
264 OAS2_0127 OAS2_0127_MR4 Converted 4 2153 M R 84 18 1.0 29.0 0.5 1668 0.694 1.052
265 OAS2_0127 OAS2_0127_MR5 Converted 5 2639 M R 86 18 1.0 30.0 0.5 1670 0.669 1.051
271 OAS2_0131 OAS2_0131_MR1 Converted 1 0 F R 65 12 2.0 30.0 0.5 1340 0.754 1.309
272 OAS2_0131 OAS2_0131_MR2 Converted 2 679 F R 67 12 2.0 25.0 0.0 1331 0.761 1.318
273 OAS2_0133 OAS2_0133_MR1 Converted 1 0 F R 78 12 3.0 29.0 0.0 1475 0.731 1.190
274 OAS2_0133 OAS2_0133_MR3 Converted 3 1006 F R 81 12 3.0 28.0 0.5 1495 0.687 1.174
295 OAS2_0144 OAS2_0144_MR1 Converted 1 0 M R 77 16 1.0 30.0 0.0 1704 0.716 1.030
296 OAS2_0144 OAS2_0144_MR2 Converted 2 683 M R 79 16 1.0 30.0 0.5 1722 0.708 1.019
297 OAS2_0145 OAS2_0145_MR1 Converted 1 0 F R 68 16 3.0 30.0 0.0 1298 0.799 1.352
298 OAS2_0145 OAS2_0145_MR2 Converted 2 1707 F R 73 16 3.0 29.0 0.5 1287 0.771 1.364
346 OAS2_0176 OAS2_0176_MR1 Converted 1 0 M R 84 16 2.0 30.0 0.0 1404 0.710 1.250
347 OAS2_0176 OAS2_0176_MR2 Converted 2 774 M R 87 16 2.0 30.0 0.0 1398 0.696 1.255
348 OAS2_0176 OAS2_0176_MR3 Converted 3 1631 M R 89 16 2.0 30.0 0.5 1408 0.679 1.246

All those classified as Converted were Nondemented on their first visit and Demented on the final visit according to the data card.

We can hence resolve this category into Nondemented (first visit) and Demented (last visit), dropping nine records which lie between the first and final visits.

In [12]:
nondemented = [33,36,57,81,114,194,218,245,261,271,273,295,297,346]
demented = [35,38,59,83,115,195,220,246,265,272,274,296,298,348]
drop = [34,37,58,82,219,262,263,264,347]
In [13]:
for n in nondemented:
    dementia_df.Group.iloc[n] = 'Nondemented'
for n in demented:
    dementia_df.Group.iloc[n] = 'Demented'
In [14]:
dementia_df = dementia_df.drop(index =[34,37,58,82,219,262,263,264,347])

Now we can drop the unneeded columns.

In [15]:
dementia_df = dementia_df.drop(columns = ['Subject ID','MRI ID'])

Now we can visualise the target following these changes.

In [16]:
sns.countplot(x=dementia_df['Group'], palette=palette)
Out[16]:
<Axes: xlabel='Group', ylabel='count'>
No description has been provided for this image
In [17]:
dementia_df.Group.value_counts()
Out[17]:
Group
Nondemented    204
Demented       160
Name: count, dtype: int64

Data Types¶


We will change all categorical features to be numerical to make it easier to work with for now.

In [18]:
dementia_df['Group'] = dementia_df['Group'].map({'Nondemented': 0, 'Demented': 1})
dementia_df['M/F'] = dementia_df['M/F'].map({'M': 0, 'F': 1})
dementia_df['Hand'] = dementia_df['Hand'].map({'R': 0, 'L': 1})
In [19]:
dementia_df['Group'] = dementia_df['Group'].astype(int)
dementia_df['M/F'] = dementia_df['M/F'].astype(int)
dementia_df['Hand'] = dementia_df['Hand'].astype(int)

Missing Values¶


In [20]:
dementia_df.isnull().sum()
Out[20]:
Group        0
Visit        0
MR Delay     0
M/F          0
Hand         0
Age          0
EDUC         0
SES         19
MMSE         2
CDR          0
eTIV         0
nWBV         0
ASF          0
dtype: int64

Visualise the missing data to see if there is a pattern.

In [21]:
dementia_df[dementia_df.isnull().any(axis=1)]
Out[21]:
Group Visit MR Delay M/F Hand Age EDUC SES MMSE CDR eTIV nWBV ASF
2 1 1 0 0 0 75 12 NaN 23.0 0.5 1678 0.736 1.046
3 1 2 560 0 0 76 12 NaN 28.0 0.5 1738 0.713 1.010
4 1 3 1895 0 0 80 12 NaN 22.0 0.5 1698 0.701 1.034
10 1 1 0 0 0 71 16 NaN 28.0 0.5 1357 0.748 1.293
11 1 3 518 0 0 73 16 NaN 27.0 1.0 1365 0.727 1.286
12 1 4 1281 0 0 75 16 NaN 27.0 1.0 1372 0.710 1.279
134 1 1 0 1 0 80 12 NaN 30.0 0.5 1430 0.737 1.228
135 1 2 490 1 0 81 12 NaN 27.0 0.5 1453 0.721 1.208
207 1 1 0 1 0 80 12 NaN 27.0 0.5 1475 0.762 1.190
208 1 2 807 1 0 83 12 NaN 23.0 0.5 1484 0.750 1.183
237 1 1 0 1 0 76 12 NaN 27.0 0.5 1316 0.727 1.333
238 1 2 570 1 0 78 12 NaN 27.0 1.0 1309 0.709 1.341
322 1 1 0 0 0 76 12 NaN 27.0 0.5 1557 0.705 1.127
323 1 2 552 0 0 78 12 NaN 29.0 1.0 1569 0.704 1.119
356 1 1 0 1 0 74 12 NaN 26.0 0.5 1171 0.733 1.499
357 1 2 539 1 0 75 12 NaN NaN 1.0 1169 0.742 1.501
358 1 3 1107 1 0 77 12 NaN NaN 1.0 1159 0.733 1.515
359 1 1 0 0 0 73 12 NaN 23.0 0.5 1661 0.698 1.056
360 1 2 776 0 0 75 12 NaN 20.0 0.5 1654 0.696 1.061

We have already dropped nine rows, so another 19 would be too many to drop.

All rows with missing values are from demented patients, so we cannot use basic imputation as it would introduce bias.

Imputation by group could be used, but this may over-simplify the data and dilute context-specific patterns.

Therefore, K-Nearest-Neighbours imputation will be used.

In [22]:
from sklearn.impute import KNNImputer
In [23]:
imputer = KNNImputer(n_neighbors=5)
In [24]:
dementia_df = pd.DataFrame(imputer.fit_transform(dementia_df), columns=dementia_df.columns)

Check that there are no more missing values.

In [25]:
dementia_df.isnull().sum()
Out[25]:
Group       0
Visit       0
MR Delay    0
M/F         0
Hand        0
Age         0
EDUC        0
SES         0
MMSE        0
CDR         0
eTIV        0
nWBV        0
ASF         0
dtype: int64

Synthetic Minority Over-sampling Technique (SMOTE)¶


In [26]:
from imblearn.over_sampling import SMOTE
In [27]:
X = dementia_df.drop('Group', axis=1)
y = dementia_df['Group']
In [28]:
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
In [29]:
dementia_df = pd.DataFrame(X_resampled, columns=X.columns)
dementia_df['Group'] = y_resampled
In [30]:
sns.countplot(x=dementia_df['Group'], palette=palette)
Out[30]:
<Axes: xlabel='Group', ylabel='count'>
No description has been provided for this image

Data Distribution & Correlations¶


Skewness Analysis¶


In [31]:
dementia_df.skew()
Out[31]:
Visit       1.055535
MR Delay    1.019545
M/F        -0.237857
Hand        0.000000
Age         0.165902
EDUC        0.055679
SES         0.116194
MMSE       -2.121994
CDR         1.087784
eTIV        0.510586
nWBV        0.248478
ASF         0.094090
Group       0.000000
dtype: float64

We can see that variables like Hand, EDUC, and ASF are nearly symmetrically distributed, while others show slight to moderate skewness.

MMSE is highly negatively skewed, and CDR is highly positively skewed.

We can compare this to the skewness of features for demented and non-demented patients specifically.

In [32]:
demented = dementia_df[dementia_df['Group'] == 1]
non_demented = dementia_df[dementia_df['Group'] == 0]
In [33]:
skew_comparison = pd.DataFrame({
    'Overall': dementia_df.skew(),
    'Non-Demented': non_demented.skew(),
    'Demented': demented.skew()
})
In [34]:
print(skew_comparison)
           Overall  Non-Demented  Demented
Visit     1.055535      1.028822  1.059986
MR Delay  1.019545      0.807749  1.276792
M/F      -0.237857     -0.784295  0.258019
Hand      0.000000      0.000000  0.000000
Age       0.165902      0.071220  0.301505
EDUC      0.055679      0.126381 -0.002939
SES       0.116194      0.342371 -0.112642
MMSE     -2.121994     -1.084445 -1.524320
CDR       1.087784      8.123034  2.054137
eTIV      0.510586      0.489330  0.516385
nWBV      0.248478      0.021469  0.277464
ASF       0.094090      0.060733  0.151353
Group     0.000000      0.000000  0.000000

We can plot this data to more easily visualise it.

To do this we need to ensure the skew_comparison DataFrame has a column for variable names.

In [35]:
skew_comparison = skew_comparison.reset_index().rename(columns={'index': 'Variable'})

And then reshape the DataFrame.

In [36]:
skew_comparison = pd.melt(skew_comparison, id_vars='Variable', var_name='Group', value_name='Skewness')
In [37]:
plt.figure(figsize=(14, 8))
sns.barplot(x='Variable', y='Skewness', hue='Group', data=skew_comparison)
plt.title('Comparison of Skewness Between Demented and Non-Demented Groups')
plt.xlabel('Variable')
plt.ylabel('Skewness')
plt.xticks(rotation=45)
plt.legend(title='Group')
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The skewness analysis reveals key differences between the Non-Demented and Demented groups. MMSE and CDR show significant skew, with MMSE negatively skewed (indicating lower cognitive scores for the demented group) and CDR positively skewed (suggesting more advanced stages of dementia in demented individuals).

Age is more skewed in the Demented group, indicating that individuals in this group are, on average, older. MR Delay is right-skewed in the Demented group, pointing to longer delays for this group. The M/F distribution is left-skewed in the Non-Demented group, showing a higher proportion of females, while the Demented group has a more balanced gender distribution.

SES shows a higher skew in the Non-Demented group, suggesting that this group generally has a higher socioeconomic status. Finally, the CDR variable has a significant positive skew in the Non-Demented group, with most individuals scoring 0, indicating no dementia. These patterns highlight significant differences in cognitive function, demographics, and clinical measures between the two groups.

Histogram¶


In [38]:
dementia_df.hist(figsize=(25,20))
Out[38]:
array([[<Axes: title={'center': 'Visit'}>,
        <Axes: title={'center': 'MR Delay'}>,
        <Axes: title={'center': 'M/F'}>,
        <Axes: title={'center': 'Hand'}>],
       [<Axes: title={'center': 'Age'}>,
        <Axes: title={'center': 'EDUC'}>,
        <Axes: title={'center': 'SES'}>,
        <Axes: title={'center': 'MMSE'}>],
       [<Axes: title={'center': 'CDR'}>,
        <Axes: title={'center': 'eTIV'}>,
        <Axes: title={'center': 'nWBV'}>,
        <Axes: title={'center': 'ASF'}>],
       [<Axes: title={'center': 'Group'}>, <Axes: >, <Axes: >, <Axes: >]],
      dtype=object)
No description has been provided for this image

As there is no variability in the 'Hand' feature, we will drop this too.

In [39]:
dementia_df = dementia_df.drop(columns='Hand')

Correlations¶


We can now check the correlations between features in the dataset.

In [40]:
dementia_corr = dementia_df.copy().corr()
dementia_corr['Group'].sort_values(ascending = False)
Out[40]:
Group       1.000000
CDR         0.845958
SES         0.159284
ASF         0.028509
Age        -0.008147
eTIV       -0.037327
Visit      -0.037486
MR Delay   -0.068232
EDUC       -0.244451
M/F        -0.251184
nWBV       -0.331096
MMSE       -0.591501
Name: Group, dtype: float64

We can plot this on a heatmap.

In [41]:
plt.figure(figsize=(20,20))
sns.heatmap(dementia_corr, annot=True, cmap=gradient_palette)
plt.show()
No description has been provided for this image

The correlation analysis reveals that CDR has the strongest positive correlation with the Group, indicating its significant role in predicting dementia severity. MMSE shows a strong negative correlation, with lower scores associated with dementia, making it another key predictor. nWBV also negatively correlates with the Group, suggesting that lower brain volume may be linked to dementia.

EDUC shows a moderate negative correlation, implying that lower education levels could be associated with a higher likelihood of dementia, though the effect is weaker. M/F indicates a slight male predominance in the demented group, but this is a minor factor. SES shows a weak positive correlation, suggesting higher socioeconomic status is slightly linked to the non-demented group, but this relationship is not strong. Other variables like Age, eTIV, Visit, MR Delay, and ASF have minimal correlations, suggesting they are less relevant for predicting dementia in this dataset.

In [42]:
important_features = ['Group', 'EDUC', 'MMSE', 'CDR', 'nWBV']

We can also visualise the important features in a pairplot.

In [43]:
sns.pairplot(dementia_df[important_features], hue='Group', palette=palette)
Out[43]:
<seaborn.axisgrid.PairGrid at 0x7feee7973380>
No description has been provided for this image

And finally let's shuffle and save the processed dataset.

In [44]:
dementia_df = dementia_df.sample(frac=1).reset_index(drop=True)
In [45]:
dementia_df.to_csv('data/dementia_data_processed.csv', index=False)

Parkinson's Disease Dataset¶


Import Dataset & Examine¶


Import Dataset¶


In [46]:
parkinsons_df = pd.read_csv('data/parkinsons_data-VOICE-features.csv')
In [47]:
parkinsons_df.rename(columns={'name': 'Name', 'status': 'Status'}, inplace=True)

Dataset Info & Structure¶


In [48]:
print(parkinsons_df.shape)
(195, 24)
In [49]:
print(parkinsons_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 17  Status            195 non-null    int64  
 18  RPDE              195 non-null    float64
 19  DFA               195 non-null    float64
 20  spread1           195 non-null    float64
 21  spread2           195 non-null    float64
 22  D2                195 non-null    float64
 23  PPE               195 non-null    float64
dtypes: float64(22), int64(1), object(1)
memory usage: 36.7+ KB
None
In [50]:
parkinsons_df.head()
Out[50]:
Name MDVP:Fo(Hz) MDVP:Fhi(Hz) MDVP:Flo(Hz) MDVP:Jitter(%) MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP MDVP:Shimmer ... Shimmer:DDA NHR HNR Status RPDE DFA spread1 spread2 D2 PPE
0 phon_R01_S01_1 119.992 157.302 74.997 0.00784 0.00007 0.00370 0.00554 0.01109 0.04374 ... 0.06545 0.02211 21.033 1 0.414783 0.815285 -4.813031 0.266482 2.301442 0.284654
1 phon_R01_S01_2 122.400 148.650 113.819 0.00968 0.00008 0.00465 0.00696 0.01394 0.06134 ... 0.09403 0.01929 19.085 1 0.458359 0.819521 -4.075192 0.335590 2.486855 0.368674
2 phon_R01_S01_3 116.682 131.111 111.555 0.01050 0.00009 0.00544 0.00781 0.01633 0.05233 ... 0.08270 0.01309 20.651 1 0.429895 0.825288 -4.443179 0.311173 2.342259 0.332634
3 phon_R01_S01_4 116.676 137.871 111.366 0.00997 0.00009 0.00502 0.00698 0.01505 0.05492 ... 0.08771 0.01353 20.644 1 0.434969 0.819235 -4.117501 0.334147 2.405554 0.368975
4 phon_R01_S01_5 116.014 141.781 110.655 0.01284 0.00011 0.00655 0.00908 0.01966 0.06425 ... 0.10470 0.01767 19.649 1 0.417356 0.823484 -3.747787 0.234513 2.332180 0.410335

5 rows × 24 columns

Statistical Summary¶


In [51]:
parkinsons_df.describe().T
Out[51]:
count mean std min 25% 50% 75% max
MDVP:Fo(Hz) 195.0 154.228641 41.390065 88.333000 117.572000 148.790000 182.769000 260.105000
MDVP:Fhi(Hz) 195.0 197.104918 91.491548 102.145000 134.862500 175.829000 224.205500 592.030000
MDVP:Flo(Hz) 195.0 116.324631 43.521413 65.476000 84.291000 104.315000 140.018500 239.170000
MDVP:Jitter(%) 195.0 0.006220 0.004848 0.001680 0.003460 0.004940 0.007365 0.033160
MDVP:Jitter(Abs) 195.0 0.000044 0.000035 0.000007 0.000020 0.000030 0.000060 0.000260
MDVP:RAP 195.0 0.003306 0.002968 0.000680 0.001660 0.002500 0.003835 0.021440
MDVP:PPQ 195.0 0.003446 0.002759 0.000920 0.001860 0.002690 0.003955 0.019580
Jitter:DDP 195.0 0.009920 0.008903 0.002040 0.004985 0.007490 0.011505 0.064330
MDVP:Shimmer 195.0 0.029709 0.018857 0.009540 0.016505 0.022970 0.037885 0.119080
MDVP:Shimmer(dB) 195.0 0.282251 0.194877 0.085000 0.148500 0.221000 0.350000 1.302000
Shimmer:APQ3 195.0 0.015664 0.010153 0.004550 0.008245 0.012790 0.020265 0.056470
Shimmer:APQ5 195.0 0.017878 0.012024 0.005700 0.009580 0.013470 0.022380 0.079400
MDVP:APQ 195.0 0.024081 0.016947 0.007190 0.013080 0.018260 0.029400 0.137780
Shimmer:DDA 195.0 0.046993 0.030459 0.013640 0.024735 0.038360 0.060795 0.169420
NHR 195.0 0.024847 0.040418 0.000650 0.005925 0.011660 0.025640 0.314820
HNR 195.0 21.885974 4.425764 8.441000 19.198000 22.085000 25.075500 33.047000
Status 195.0 0.753846 0.431878 0.000000 1.000000 1.000000 1.000000 1.000000
RPDE 195.0 0.498536 0.103942 0.256570 0.421306 0.495954 0.587562 0.685151
DFA 195.0 0.718099 0.055336 0.574282 0.674758 0.722254 0.761881 0.825288
spread1 195.0 -5.684397 1.090208 -7.964984 -6.450096 -5.720868 -5.046192 -2.434031
spread2 195.0 0.226510 0.083406 0.006274 0.174351 0.218885 0.279234 0.450493
D2 195.0 2.381826 0.382799 1.423287 2.099125 2.361532 2.636456 3.671155
PPE 195.0 0.206552 0.090119 0.044539 0.137451 0.194052 0.252980 0.527367
In [52]:
print(f"Number of unique subjects: {len(parkinsons_df['Name'].unique())}")
Number of unique subjects: 195

Preparing the data¶


In [ ]:
 

Target Examination¶


In [53]:
sns.countplot(x=parkinsons_df['Status'], palette=palette)
Out[53]:
<Axes: xlabel='Status', ylabel='count'>
No description has been provided for this image
In [54]:
parkinsons_df.Status.value_counts()
Out[54]:
Status
1    147
0     48
Name: count, dtype: int64

As there are no repeated patients in this dataset, we can remove the 'name' column.

In [55]:
parkinsons_df = parkinsons_df.drop(columns=['Name'])

Data Types¶


As we saw from the dataset info, the only non-numerical column has been dropped, so we do not need to change any datatypes for this dataset.

Missing values¶


In [56]:
parkinsons_df.isnull().sum()
Out[56]:
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
Status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64

As we can see, there are no missing values in this dataset, so we do not need to do anything here.

Synthetic Minority Over-sampling Technique (SMOTE)¶


In [57]:
X = parkinsons_df.drop('Status', axis=1)
y = parkinsons_df['Status'] 
In [58]:
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
In [59]:
parkinsons_df = pd.DataFrame(X_resampled, columns=X.columns)
parkinsons_df['Status'] = y_resampled
In [60]:
sns.countplot(x=parkinsons_df['Status'], palette=palette)
Out[60]:
<Axes: xlabel='Status', ylabel='count'>
No description has been provided for this image

Data Distribution & Correlation¶


Skewness Analysis¶


In [61]:
parkinsons_df.skew()
Out[61]:
MDVP:Fo(Hz)         0.327985
MDVP:Fhi(Hz)        2.195724
MDVP:Flo(Hz)        0.857985
MDVP:Jitter(%)      3.469224
MDVP:Jitter(Abs)    2.875020
MDVP:RAP            3.891128
MDVP:PPQ            3.575468
Jitter:DDP          3.892298
MDVP:Shimmer        2.187223
MDVP:Shimmer(dB)    2.523809
Shimmer:APQ3        2.082909
Shimmer:APQ5        2.350899
MDVP:APQ            3.156754
Shimmer:DDA         2.082909
NHR                 4.637074
HNR                -0.434758
RPDE                0.077852
DFA                 0.142630
spread1             0.687206
spread2             0.399361
D2                  0.432476
PPE                 1.067788
Status              0.000000
dtype: float64

The MDVP-related features, such as MDVP: Fhi(Hz), MDVP: Jitter(%), and MDVP: RAP, exhibit strong positive skew, indicating that most values are clustered at the lower end with some extreme higher values. These features are likely important for prediction, as the spread of values can help distinguish between different conditions.

NHR also shows significant positive skew, while HNR and status have negative skew, with values concentrated towards the higher end.

Other features like RPDE, DFA, spread1, and spread2 have near-zero skew, implying more symmetric distributions.

We can compare this to the skewness of features for healthy and diseased patients specifically.

In [62]:
healthy = parkinsons_df[parkinsons_df['Status'] == 1]
diseased = parkinsons_df[parkinsons_df['Status'] == 0]
In [63]:
skew_comparison = pd.DataFrame({
    'Overall': parkinsons_df.skew(),
    'Healthy': healthy.skew(),
    'Diseased': diseased.skew()
})
In [64]:
print(skew_comparison)
                   Overall   Healthy  Diseased
MDVP:Fo(Hz)       0.327985  0.366850 -0.297601
MDVP:Fhi(Hz)      2.195724  2.803273  1.671469
MDVP:Flo(Hz)      0.857985  0.893664  0.252275
MDVP:Jitter(%)    3.469224  2.857486  2.339793
MDVP:Jitter(Abs)  2.875020  2.554055  1.171623
MDVP:RAP          3.891128  3.059989  2.079847
MDVP:PPQ          3.575468  2.791292  1.733652
Jitter:DDP        3.892298  3.061427  2.080291
MDVP:Shimmer      2.187223  1.370146  1.105675
MDVP:Shimmer(dB)  2.523809  1.716727  1.203896
Shimmer:APQ3      2.082909  1.282823  1.034464
Shimmer:APQ5      2.350899  1.477983  1.499901
MDVP:APQ          3.156754  2.376708  0.585278
Shimmer:DDA       2.082909  1.282875  1.034042
NHR               4.637074  3.903748  3.090148
HNR              -0.434758 -0.619045  0.599117
RPDE              0.077852 -0.330289  0.279983
DFA               0.142630 -0.151460  0.384390
spread1           0.687206  0.558212  0.462695
spread2           0.399361  0.163586 -0.142497
D2                0.432476  0.474484 -0.345554
PPE               1.067788  0.841617  0.862174
Status            0.000000  0.000000  0.000000

We can plot this data to more easily visualise it.

To do this we need to ensure the skew_comparison DataFrame has a column for variable names.

In [65]:
skew_comparison = skew_comparison.reset_index().rename(columns={'index': 'Variable'})

And then reshape the DataFrame.

In [66]:
skew_comparison = pd.melt(skew_comparison, id_vars='Variable', var_name='Status', value_name='Skewness')
In [67]:
plt.figure(figsize=(14, 8))
sns.barplot(x='Variable', y='Skewness', hue='Status', data=skew_comparison)
plt.title('Comparison of Skewness Between Demented and Non-Demented Groups')
plt.xlabel('Variable')
plt.ylabel('Skewness')
plt.xticks(rotation=45)
plt.legend(title='Status')
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The skewness analysis of the Parkinson's dataset reveals several notable patterns between the Healthy and Diseased groups. MDVP: Fo(Hz) and MDVP: Fhi(Hz) exhibit high skewness in both groups, with the Healthy group showing a more pronounced positive skew, indicating that these features are more variable in the healthy population. MDVP Flo(Hz), MDVP: Jitter(%), and MDVP: Jitter(Abs) also show moderate skewness in both groups, with the Diseased group tending towards less positive skew, which could point to lower variability in these features for individuals with Parkinson's.

Shimmer-related features like MDVP: Shimmer and Shimmer: APQ5 are more skewed in the Healthy group, suggesting more variability in this measure for healthy individuals. On the other hand, MDVP: APQ has higher skewness in the Healthy group, possibly indicating a different vocal pattern or greater variance in healthy individuals compared to the diseased ones.

NHR shows significant positive skew in both groups, but the Healthy group has a higher skew, possibly reflecting more pronounced differences in speech-related features for healthy individuals.

HNR, Status, RPDE, DFA, and spread2 all exhibit negative skew, with HNR showing a more pronounced negative skew in the Diseased group. The negative skew of Status could reflect the distribution of disease severity, with most diseased individuals falling into lower severity levels.

Histogram¶


In [68]:
parkinsons_df.hist(figsize=(25,20))
Out[68]:
array([[<Axes: title={'center': 'MDVP:Fo(Hz)'}>,
        <Axes: title={'center': 'MDVP:Fhi(Hz)'}>,
        <Axes: title={'center': 'MDVP:Flo(Hz)'}>,
        <Axes: title={'center': 'MDVP:Jitter(%)'}>,
        <Axes: title={'center': 'MDVP:Jitter(Abs)'}>],
       [<Axes: title={'center': 'MDVP:RAP'}>,
        <Axes: title={'center': 'MDVP:PPQ'}>,
        <Axes: title={'center': 'Jitter:DDP'}>,
        <Axes: title={'center': 'MDVP:Shimmer'}>,
        <Axes: title={'center': 'MDVP:Shimmer(dB)'}>],
       [<Axes: title={'center': 'Shimmer:APQ3'}>,
        <Axes: title={'center': 'Shimmer:APQ5'}>,
        <Axes: title={'center': 'MDVP:APQ'}>,
        <Axes: title={'center': 'Shimmer:DDA'}>,
        <Axes: title={'center': 'NHR'}>],
       [<Axes: title={'center': 'HNR'}>,
        <Axes: title={'center': 'RPDE'}>,
        <Axes: title={'center': 'DFA'}>,
        <Axes: title={'center': 'spread1'}>,
        <Axes: title={'center': 'spread2'}>],
       [<Axes: title={'center': 'D2'}>, <Axes: title={'center': 'PPE'}>,
        <Axes: title={'center': 'Status'}>, <Axes: >, <Axes: >]],
      dtype=object)
No description has been provided for this image

Correlations¶


We can now check the correlations between features in the dataset.

In [69]:
parkinsons_corr = parkinsons_df.copy().corr()
parkinsons_corr['Status'].sort_values(ascending=False)
Out[69]:
Status              1.000000
spread1             0.660577
PPE                 0.635133
spread2             0.532955
MDVP:Shimmer        0.484928
MDVP:APQ            0.483659
Shimmer:APQ5        0.468132
MDVP:Shimmer(dB)    0.463494
Shimmer:APQ3        0.459036
Shimmer:DDA         0.459025
MDVP:Jitter(Abs)    0.434579
D2                  0.404671
MDVP:PPQ            0.379451
MDVP:Jitter(%)      0.359702
MDVP:RAP            0.353253
Jitter:DDP          0.353233
RPDE                0.347461
DFA                 0.296653
NHR                 0.240929
MDVP:Fhi(Hz)       -0.183257
MDVP:Flo(Hz)       -0.419821
HNR                -0.424198
MDVP:Fo(Hz)        -0.427653
Name: Status, dtype: float64

We can plot this on a heatmap.

In [70]:
plt.figure(figsize=(20,20))
sns.heatmap(parkinsons_corr, annot=True, cmap=gradient_palette)
plt.show()
No description has been provided for this image

The correlation analysis of the Parkinson's dataset reveals several important patterns related to the Status of the individuals. Spread1 and PPE show the strongest positive correlations with Status, indicating that greater variability in speech features and potentially higher vocal effort are associated with more severe Parkinson's symptoms. Spread2 also shows a moderate positive correlation, suggesting a similar relationship, though slightly weaker.

Speech-related features like MDVP: Shimmer, MDVP: APQ, and Shimmer: APQ5 have moderate positive correlations with Status, implying that these features are linked to disease severity in Parkinson's patients. Notably, MDVP: Shimmer(dB) and Shimmer: APQ3 also correlate moderately with Status, pointing to their potential role in distinguishing between stages of Parkinson's.

D2 and MDVP: Jitter(Abs) show weaker positive correlations, highlighting that vocal features associated with irregularities and pitch variation may also be relevant for assessing the severity of Parkinson's, though their impact is less pronounced than the other speech features.

On the other hand, HNR, MDVP: Fo(Hz), and MDVP: Flo(Hz) show negative correlations with Status, suggesting that lower values of these features may be associated with more severe Parkinson's symptoms. The stronger negative correlation between HNR and Status indicates that speech harmonics, which are influenced by vocal quality, could serve as a significant indicator of disease progression.

In summary, speech features such as Spread1, PPE, and MDVP: Shimmer have the strongest correlations with disease severity in Parkinson's patients, while features like HNR and MDVP: Fo(Hz) show significant negative correlations. This suggests that both the variability and quality of speech may be key indicators for predicting the severity of Parkinson's disease.

In [71]:
important_features = ['Status', 'spread1', 'PPE', 'MDVP:Shimmer', 'MDVP:APQ', 'Shimmer:APQ5', 'Shimmer:DDA', 'MDVP:Shimmer(dB)', 'HNR', 'MDVP:Fo(Hz)', 'MDVP:Flo(Hz)']

We can also visualise the important features in a pairplot.

In [72]:
sns.pairplot(parkinsons_df[important_features], hue='Status', palette=palette)
Out[72]:
<seaborn.axisgrid.PairGrid at 0x7feee75082f0>
No description has been provided for this image

And finally let's shuffle and save the processed dataset.

In [73]:
parkinsons_df = parkinsons_df.sample(frac=1).reset_index(drop=True)
In [74]:
parkinsons_df.to_csv('data/parkinsons_data_processed.csv', index=False)